January 14, 2025
This is a class that is going to provide an overview of the statistical methods that underpin a number of more advanced machine learning techniques
Class flow:
Part 1: Review of some important topics, regression methods and regularization, first-order optimization methods
Part 2: Neural Networks and Deep Learning, NNs for tabular data, CNNs for images, RNNs and Transformers for sequences
Part 3: Generative Machine Learning, basics of generation methods, PCA and Autoencoders, Autoregressive Models, Generative Adversarial Networks
For this class, I am expecting that students have taken a class on:
Linear regression (Like QTM 220)
Machine Learning (like QTM 347)
Calculus 1-3
Linear algebra
I don’t expect you to remember everything from these classes, but I do expect that when you see certain concepts they aren’t completely foreign.
We’re going to meet here on Tuesdays and Thursdays from 2:30 PM - 3:45 PM
There is no formal attendance requirement for this course
That said, attendance is strongly encouraged (read basically required)
It’s not so much attending, it’s more so committing 90 minutes twice a week to this class’ materials
During lectures:
I expect that students who attend will be ready to participate (ask questions, answer check-in questions, etc.)
Be willing to stop me if there’s something that isn’t clear. If it’s not clear to you, then I’m sure there’s someone else in class who is also confused.
It’s a relatively small class, so I have no problem spending time re-clarifying points that may not have been well made.
I will also be simulcasting the lectures on Zoom in the class Zoom room posted to Canvas.
However, I will not be recording these lectures
If, for some reason, you need me to record, let me know beforehand and I’ll decide if it is appropriate.
All lecture slides will be posted before class.
I’ll have two office hours periods this semester:
Mondays 4:00 - 5:15ish
Wednesdays from 5:30 - 6:30
All office hours will occur in my office - PAIS 579
Over the course of the semester, there will be 6-8 problems sets. These will account for 50% of your final grade.
Implement and extend the materials discussed in class.
Introduce software and coding
Derivations and Proofs
Problem Sets will be where you get your applied practice
Lectures will largely center on the theory of why things work
Problem sets will relate more to the how
Each problem set will be 2-3 questions
Time required will largely depend on level of comfort with coding!
Really good applied practice to beef up your coding skills.
Problem set solutions should be written up in some form of Markdown
Quarto or a Jupyter Notebook is probably the best for this
Weaves code, text, and Latex together seamlessly
All problem sets should be submitted to the appropriate Canvas assignment as two files:
The raw notebook file (.rmd, .qmd, .ipynb)
A rendered version of the notebook (.html or .pdf)
Each problem set will be posted in a variety of different formats, so feel free to use this as a template for your final solutions.
All problem sets in this class can be completed in groups of, at most, 3 students
Each student should turn in a copy of the solutions, but can be identical to group’s solutions.
All collaborators must be outlined at the top in the by-line
I recommend the same group each time - repeated vs. one-shot games of trust.
Since it is an upper level course, I’m creating a mechanism for getting rid of shirkers
You can complete assignments individually.
I can also assign you a group, if you’d like. Just email me and I’ll put people together who want a group.
Each students gets one freebie for a problem set. Can be turned in up to 5 days after the due date with no penalty.
After the freebie, late assignments will receive a 10% per day deduction
There can be exceptions, but these will be hard to receive
Plan ahead accordingly
The other 50% of your final grade will be determined by a final project
A significant project that applies methods discussed in this class
Up to you (and your group) exactly what this is
Do something interesting!
Projects that are too simple will not be accepted
No basic comparisons of methods
Really try to answer a question that’s interesting to you
Three checkpoints:
On March 25th, teams will present a single slide outlining their final project (5%)
On April 25th, teams will present project posters at QTM’s end of semester showcase (20%)
By May 7th, each student should submit a final paper about the final project. A scientific-styled paper no more than 15 pages. (25%)
The goal of this final project is to give you something to include in your portfolio as you apply for jobs or grad school
There aren’t a lot of assignments in this class because I really want you to put a lot of effort into this project!
There will be five textbooks we’ll use in this class:
Probabilistic Machine Learning: An Introduction (PML1)
Probabilistic Machine Learning: Advanced Topics (PML2)
Deep Learning (DL)
Understanding Deep Learning (UDL)
Elements of Statistical Learning (ESL)
Each book is freely available online and posted to Canvas.
I’ll post corresponding chapters for each topic in the weekly modules
I would recommend using lectures as a starting point and reading chapters to fill in gaps
Many topics in these books that we won’t cover!
Modern statistical machine learning is equal parts pen-and-paper and computational implementation.
A large portion of this course revolves around programming and computing.
The lectures are going to talk about this, but the majority of the applications will be on you to figure out in your problem sets
Reading documentation is a skill!
Understanding how to write functions is key. Being able to build an algorithm (even if it’s not elegant) is a useful skill
I’ll be using Python for this class.
You can technically use any language you want for this class
But, I highly recommend Python
Deep learning libraries are developed mostly with Python in mind
My setup:
Python 3.10
VSCode
Jupyter Notebooks/Quarto
My local hardware:
AMD Ryzen 5900x - 12 core/24 thread processor
NVIDIA RTX 3090TI - 10,752 CUDA Cores/24GB GDDR6 VRAM
Please use Github Copilot/ChatGPT4o!
So many of the annoyances of coding with Python are gone when you use code LLMs
Matplotlib syntax
Documentation searches
scikit-learn nuances
If you have a Github account, get it student certified and you’ll be able to use Copilot for free
As we progress through this class, we’re going to get to methods that are computationally demanding
If you don’t have a machine with a discrete GPU, you’ll be using Google Colab
I highly recommend paying $10 a month for Colab Pro during this class
Gives access to better GPUs and priority time on them
If $10 a month presents a problem, let me know.
We’ll be using PyTorch throughout this class
Software for general purpose optimization
The dominant software for deep learning
Using a nice utility functions from PyTorch Lightning
A really nice library that sits over PyTorch that organizes code
Auotmatically detects and uses GPUs and other accelerators when available
The topics I hope to cover this semester are outlined on the syllabus.
The pace is ambitious.
If we need to take more time on topics, we will adjust.
Some stretch goals on GANs and VAEs at the end of the semester. Will replace as needed.
Any questions?
What is a machine learning algorithm?
I really like this definition (Mitchell, 1997):
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
In other words, a machine learning algorithm learns from data - the more data and the more trials, the better
We would hope that the algorithm is able to perform well at a defined task given some data
Machine learning is so popular because it can be used to address many different questions/achieve many different tasks.
Two basic supervised starting points:
Regression - given a vector of input features, \mathbf X, and an outcome vector, \mathbf y \in \mathbb R, output a function \hat{f}(\mathbf x) that accurately predicts the value of the outcome given \mathbf x.
What are some questions we can answer with regression algorithms?
Machine learning is so popular because it can be used to address many different questions/achieve many different tasks.
Two basic supervised starting points:
Classification - given a vector of input features, \mathbf X, and an outcome vector, \mathbf y \in (1,2,...,K), output a function \hat{f}(\mathbf x) that accurately predicts the class given \mathbf x.
What are some questions we can answer with classification algorithms?
These two basic tasks describe much of machine learning!
In your introductory machine learning class, the methods discussed largely centered on solving these kinds of problems.
We have a feature matrix ( N observations of P features), \mathbf X
We have an outcome vector ( N observations of the outcome), \mathbf y
Use \mathbf X and \mathbf y to learn \hat{f}(\mathbf x) that maps features to a predicted outcome.
However, there are many more interesting tasks that can be solved using machine learning!
Classification with Missing Inputs:
Given a feature matrix, \mathbf X, predict the associated class for a new feature vector, \mathbf x_0.
However, it is not guaranteed that \mathbf X or \mathbf x_0 has all of the features
Let’s suppose that we are attempting to learn a binary class using logistic regression:
Given \mathbf X and \mathbf y, define our hypothesis class as the set of functions covered by:
Pr(y = 1 | \mathbf x) = \frac{exp[\mathbf x^T \hat{\boldsymbol \beta}]}{1 + exp[\mathbf x^T \hat{\boldsymbol \beta}]}
The goal:
Find \hat{\boldsymbol \beta} that minimizes the cross-entropy generalization error (or log loss) given the training data.
The issue: if \mathbf x_0 has missing values, we can’t assess the function output!
The tools discussed in your intro class can’t handle this!
A proposal:
Assume that each \mathbf x is a draw from f(\mathbf X) - a P dimensional proper probability density function.
Also, each y is a draw from the proper probability mass function f(y | \mathbf x)
Discriminative models learn the conditional density of y given \mathbf x
Generative models learn the joint density of \mathbf x and y - typically via f(y , \mathbf x) = f(y | \mathbf x)f(\mathbf x).
Remember LDA, QDA, and Naive Bayes?
These models are too simple, though, for many cases!
If we could adequately jointly learn f(\mathbf X) and f(y | \mathbf x), then we could fill in the missing information with the most likely values given what we do observe.
The toolkit you currently have in unequipped to do this in a meaningfully flexible way!
Note that it is really common to have missing inputs
Medical diagnosis - only a few tests are run for each person, but different tests across the population
Survey responses - people aren’t required to fill out every answer
Missing matchups in college basketball - if we want to predict who will win the NCAA tournament, every team doesn’t play in the regular season
Structured Inputs/Outputs
Suppose our outcomes of interest are not just a number or a class.
Rather, it’s a sequence of some sort:
Pixels in an image
A sentence
How do we use regression tools to predict an outcome of this sort?
One approach is to develop different functions for each independent output (each pixel in an image, for example)
Each output heavily depends on the others!
Is it possible to learn about many different correlated outcomes at once?
Synthesis
Given a training set, \mathbf X, generate new examples that are similar to those in the training data
Think DALL-E, the image generation software
Or Chat-GPT which provides coherent textual answers to questions that it may not have seen in the training set
These methods are getting really good!
Let’s play a game:
Even music can be convincingly generated using generative methods these days!
Note: This is an at home listen. Also, apologies for the language. But, this is the best AI generated music in the game right now!
A formal statement:
We believe that images/text/audio are generated from a true joint distribution over pixels, \mathbf X
Each example we see, \mathbf x_i, is a sample from this distribution
Goal: Learn f(\mathbf X) in such a way that:
It sufficiently captures complex dependencies between inputs
It can be sampled from to generate new coherent objects
It can be modified to emphasize desired features
There are many other important tasks that machine learning can achieve that don’t exactly fit into the regression or classification bucket:
Machine translation
Anomaly detection
Denoising
Standard density estimation
Many, many more
These tasks tend to:
Have complicated mappings of complicated inputs to complicated outputs
Have high-dimensional feature sets
Involve learning something about the distribution of the features
In other words, many interesting tasks are more complex than tasks that can be addressed via basic machine learning methods!
In your previous classes, you’ve likely covered a number of sophisticated methods.
However, these methods are limited by their rigidity
The types of f(y | \mathbf x) that can be uncovered is limited in complexity
Or the type of y is limited to standard scalar outputs
Or the computational complexity of the method is extremely high
And very few are capable of jointly learning f(\mathbf x) and f(y | \mathbf x)
Linear Regression
\hat{y} = \mathbf x^T \boldsymbol \beta
can only uncover functions that are linear combinations of the input features
We can add nonlinear terms, but we’re still limited to additive combinations
The more features, the more terms we need to compute if we’re adding complexity via this approach
Note that logistic regression is just linear regression with a twist!
Linear Regression
We can add complexity to the linear model a number of different ways:
Global polynomials (curse of dimensionality)
Regularization terms like Ridge and LASSO (creates simpler linear models)
Splines (don’t generalize to high dimensional problems)
None of these approaches allow us to really deal with really complex structures!
Tree-based Methods
\hat{y} = Avg(y \in R)
require no specific functional form, but are limited to scalar y values
Sufficiently flexible in functional form
Unable to handle complex inputs and outputs
Kernel-based Approaches
\hat{y} = K(\mathbf x, \mathbf X)
can uncover any possible functional form with an appropriate kernel (the RBF kernel, for example)
However, computation for kernel based methods can be extremely costly when N is large
Unfortunately, our toolkit is largely unable to deal with real complex problems!
But, we can get around that by introducing deep learning into the toolkit
A method that allows us to make regression quite flexible
And scalable to large data sets
Based largely around the concept of neural networks:
\hat{y} = \phi(\theta_1\phi(\theta_2(...)))
Deep learning presents a scalable way to address really complicated problems!
Additionally, encoder/decoder architectures allow us to start thinking about meaningful ways to learn about f(\mathbf x), as well.
Before we explore neural networks, though:
Spend some time reviewing machine learning and predictive modeling basics
Review regression methods and regularization
Discuss convex optimization methods
Added together, we’ll have a good underpinning for really understanding and appreciating how deep learning works!
My thought:
It takes a week or two to learn how to use PyTorch - you could do that on your own
It takes a lot longer to really understand why neural networks and deep learning work the way that they do
A lot of math and statistics
Really appreciate the modern marvel of LLMs and how quickly they can provide coherent answers!
A review of generalization
Underfitting and overfitting
Computing generalization error